Model Selection

Cross-modal reasoning

# Cross-modal reasoning

Gemma 3n E4B It

Gemma 3n is a lightweight and state-of-the-art open-source multimodal model family launched by Google. It is built on the same research and technology as the Gemini model and supports text, audio, and visual inputs.

Qwen2 VL 2B Instruct

Qwen2-VL-2B-Instruct is a multimodal vision-language model that supports image-text-to-text tasks.

Transformers English

Aya Vision 32B is an open-weight 32B parameter multimodal model developed by Cohere Labs, supporting vision-language tasks in 23 languages.

Transformers Supports Multiple Languages

Eilev Blip2 Opt 2.7b

A first-person perspective optimized vision-language model trained on BLIP-2-OPT-2.7B, employing the innovative EILEV method to stimulate in-context learning capabilities

Transformers English

Layoutlmv3 Base Mpdocvqa

This model is a document visual question answering model fine-tuned on the Multi-page Document VQA (MP-DocVQA) dataset, based on Microsoft's pre-trained LayoutLMv3 model.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase